Corpus-based Sinhala Lexicon

نویسندگان

  • Ruvan Weerasinghe
  • Dulip Lakmal Herath
  • Viraj Welgama
چکیده

Lexicon is in important resource in any kind of language processing application. Corpus-based lexica have several advantages over other traditional approaches. The lexicon developed for Sinhala was based on the text obtained from a corpus of 10 million words drawn from diverse genres. The words extracted from the corpus have been labeled with parts of speech categories defined according to a novel classification proposed for Sinhala. The lexicon reports 80% coverage over unrestricted text obtained from online sources. The lexicon has been implemented in Lexical Mark up Framework.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Rule Based Syllabification Algorithm for Sinhala

This paper presents a study of Sinhala syllable structure and an algorithm for identifying syllables in Sinhala words. After a thorough study of the Syllable structure and linguistic rules for syllabification of Sinhala words and a survey of the relevant literature, a set of rules was identified and implemented as a simple, easy-to-implement algorithm. The algorithm was tested using 30,000 dist...

متن کامل

Lexicon and hidden Markov model-based optimisation of the recognised Sinhala script

The Brahmi descended Sinhala script is used by 75% of the 18 million population in Sri Lanka. To the best of our knowledge, none of the Brahmi descended scripts used by hundreds of millions of people in South Asia, possess commercial OCR products. In the process of implementation of an OCR system for the printed Sinhala script which is easily adoptable to similar scripts [Premaratne, L., Assabi...

متن کامل

Dialogue Act Recognition for Text-based Sinhala

This paper discusses the application of classical machine learning approaches to the task of Dialogue Act Recognition for text-based Sinhala. A study was carried out to identify a dialogue act tag set for Sinhala. A new corpus using Sinhala subtitles for English movies was created and was annotated with the selected dialogue acts. Evaluation of the dialogue act recognition system was performed ...

متن کامل

Hidden Markov Model Based Part of Speech Tagger for Sinhala Language

In this paper we present a fundamental lexical semantics of Sinhala language and a Hidden Markov Model (HMM) based Part of Speech (POS) Tagger for Sinhala language. In any Natural Language processing task, Part of Speech is a very vital topic, which involves analysing of the construction, behaviour and the dynamics of the language, which the knowledge could utilized in computational linguistics...

متن کامل

Sinhala Grapheme-to-Phoneme Conversion and Rules for Schwa Epenthesis

This paper describes an architecture to convert Sinhala Unicode text into phonemic specification of pronunciation. The study was mainly focused on disambiguating schwa-/\/ and /a/ vowel epenthesis for consonants, which is one of the significant problems found in Sinhala. This problem has been addressed by formulating a set of rules. The proposed set of rules was tested using 30,000 distinct wor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009